20 research outputs found
The Zero Resource Speech Challenge 2017
We describe a new challenge aimed at discovering subword and word units from
raw speech. This challenge is the followup to the Zero Resource Speech
Challenge 2015. It aims at constructing systems that generalize across
languages and adapt to new speakers. The design features and evaluation metrics
of the challenge are presented and the results of seventeen models are
discussed.Comment: IEEE ASRU (Automatic Speech Recognition and Understanding) 2017.
Okinawa, Japa
Learning spectro-temporal representations of complex sounds with parameterized neural networks
Deep Learning models have become potential candidates for auditory
neuroscience research, thanks to their recent successes on a variety of
auditory tasks. Yet, these models often lack interpretability to fully
understand the exact computations that have been performed. Here, we proposed a
parametrized neural network layer, that computes specific spectro-temporal
modulations based on Gabor kernels (Learnable STRFs) and that is fully
interpretable. We evaluated predictive capabilities of this layer on Speech
Activity Detection, Speaker Verification, Urban Sound Classification and Zebra
Finch Call Type Classification. We found out that models based on Learnable
STRFs are on par for all tasks with different toplines, and obtain the best
performance for Speech Activity Detection. As this layer is fully
interpretable, we used quantitative measures to describe the distribution of
the learned spectro-temporal modulations. The filters adapted to each task and
focused mostly on low temporal and spectral modulations. The analyses show that
the filters learned on human speech have similar spectro-temporal parameters as
the ones measured directly in the human auditory cortex. Finally, we observed
that the tasks organized in a meaningful way: the human vocalizations tasks
closer to each other and bird vocalizations far away from human vocalizations
and urban sounds tasks
Sampling strategies in Siamese Networks for unsupervised speech representation learning
Recent studies have investigated siamese network architectures for learning
invariant speech representations using same-different side information at the
word level. Here we investigate systematically an often ignored component of
siamese networks: the sampling procedure (how pairs of same vs. different
tokens are selected). We show that sampling strategies taking into account
Zipf's Law, the distribution of speakers and the proportions of same and
different pairs of words significantly impact the performance of the network.
In particular, we show that word frequency compression improves learning across
a large range of variations in number of training pairs. This effect does not
apply to the same extent to the fully unsupervised setting, where the pairs of
same-different words are obtained by spoken term discovery. We apply these
results to pairs of words discovered using an unsupervised algorithm and show
an improvement on state-of-the-art in unsupervised representation learning
using siamese networks.Comment: Conference paper at Interspeech 201
Shennong: a Python toolbox for audio speech features extraction
International audienceWe introduce Shennong, a Python toolbox and command-line utility for speech features extraction. It implements a wide range of well-established state of art algorithms including spectro-temporal filters such as Mel-Frequency Cepstral Filterbanks or Predictive Linear Filters, pre-trained neural networks, pitch estimators as well as speaker normalization methods and post-processing algorithms. Shennong is an open source, easy-to-use, reliable and extensible framework. The use of Python makes the integration to others speech modeling and machine learning tools easy. It aims to replace or complement several heterogeneous software, such as Kaldi or Praat. After describing the Shennong software architecture, its core components and implemented algorithms, this paper illustrates its use on three applications: a comparison of speech features performances on a phones discrimination task, an analysis of a Vocal Tract Length Normalization model as a function of the speech duration used for training and a comparison of pitch estimation algorithms under various noise conditions
Shennong: a Python toolbox for audio speech features extraction
International audienceWe introduce Shennong, a Python toolbox and command-line utility for speech features extraction. It implements a wide range of well-established state of art algorithms including spectro-temporal filters such as Mel-Frequency Cepstral Filterbanks or Predictive Linear Filters, pre-trained neural networks, pitch estimators as well as speaker normalization methods and post-processing algorithms. Shennong is an open source, easy-to-use, reliable and extensible framework. The use of Python makes the integration to others speech modeling and machine learning tools easy. It aims to replace or complement several heterogeneous software, such as Kaldi or Praat. After describing the Shennong software architecture, its core components and implemented algorithms, this paper illustrates its use on three applications: a comparison of speech features performances on a phones discrimination task, an analysis of a Vocal Tract Length Normalization model as a function of the speech duration used for training and a comparison of pitch estimation algorithms under various noise conditions
Learning Word Embeddings: Unsupervised Methods for Fixed-size Representations of Variable-length Speech Segments
International audienceFixed-length embeddings of words are very useful for a variety of tasks in speech and language processing. Here we systematically explore two methods of computing fixed-length embeddings for variable-length sequences. We evaluate their susceptibility to phonetic and speaker-specific variability on English, a high resource language and Xitsonga, a low resource language, using two evaluation metrics: ABX word discrimination and ROC-AUC on same-different phoneme n-grams. We show that a simple downsampling method supplemented with length information can outperform the variable-length input feature representation on both evaluations. Recurrent autoencoders, trained without supervision, can yield even better results at the expense of increased computational complexity
A K-nearest neighbours approach to unsupervised spoken term discovery
International audienceUnsupervised spoken term discovery is the task of finding recurrent acoustic patterns in speech without any annotations. Current approaches consists of two steps: (1) discovering similar patterns in speech, and (2) partitioning those pairs of acoustic tokens using graph clustering methods. We propose a new approach for the first step. Previous systems used various approximation algorithms to make the search tractable on large amounts of data. Our approach is based on an optimized k-nearest neighbours (KNN) search coupled with a fixed word embedding algorithm. The results show that the KNN algorithm is robust across languages, consistently out-performs the DTW-based baseline, and is competitive with current state-of-the-art spoken term discovery systems
Comparison of actual and time-optimized flight trajectories in the context of the In-service Aircraft for a Global Observing System (IAGOS) programme
Airlines optimize flight trajectories in order to minimize their operational costs, of which fuel consumption is a large contributor. It is known that flight trajectories are not fuel-optimal because of airspace congestion and restrictions, safety regulations, bad weather and other operational constraints. However the extent to which trajectories are not fuel-optimal (and therefore CO2-optimal) is not well known. In this study we present two methods for optimizing the flight cruising time by taking best advantage of the wind pattern at a given flight level and for constant airspeed. We test these methods against actual flight trajectories recorded under the In-service Aircraft for a Global Observing System (IAGOS) programme. One method is more robust than the other (computationally
faster) method, but when successful, the two methods agree very well with each other, with optima generally within of the order of 0.1%. The IAGOS actual cruising trajectories are on average 1% longer than the computed optimal for the transatlantic route, which leaves little room for improvement given that by construction the actual trajectory cannot be better than our optimum. The average degree of non-optimality is larger for some other routes and can be up to 10%. On some routes there are also outlier flights that are not well optimized; however the reason for this is not known
Sampling strategies in Siamese Networks for unsupervised speech representation learning
Conference paper at Interspeech 2018International audienceRecent studies have investigated siamese network architectures for learning invariant speech representations using same-different side information at the word level. Here we investigate systematically an often ignored component of siamese networks: the sampling procedure (how pairs of same vs. different tokens are selected). We show that sampling strategies taking into account Zipf's Law, the distribution of speakers and the proportions of same and different pairs of words significantly impact the performance of the network. In particular, we show that word frequency compression improves learning across a large range of variations in number of training pairs. This effect does not apply to the same extent to the fully unsupervised setting, where the pairs of same-different words are obtained by spoken term discovery. We apply these results to pairs of words discovered using an unsupervised algorithm and show an improvement on state-of-the-art in unsupervised representation learning using siamese networks
DP-Parse: Finding Word Boundaries from Raw Speech with an Instance Lexicon
International audienceFinding word boundaries in continuous speech is challenging as there is little or no equivalent of a 'space' delimiter between words. Popular Bayesian non-parametric models for text segmentation (Goldwater et al., 2006, 2009) use a Dirichlet process to jointly segment sentences and build a lexicon of word types. We introduce DP-Parse, which uses similar principles but only relies on an instance lexicon of word tokens, avoiding the clustering errors that arise with a lexicon of word types. On the Zero Resource Speech Benchmark 2017, our model sets a new speech segmentation state-of-theart in 5 languages. The algorithm monotonically improves with better input representations, achieving yet higher scores when fed with weakly supervised inputs. Despite lacking a type lexicon, DP-Parse can be pipelined to a language model and learn semantic and syntactic representations as assessed by a new spoken word embedding benchmark.